Linear Models with Matrices

Author

Marnin Wolfe

Previously: Reproducible Projects in R

Lesson plan

My goals are to demonstrate:

  1. The underlying mechanics of linear fixed-effects models using matrix operations in R
  2. The use of lm() for ANOVA and regression

First we’ll use a small “toy” dataset. Easy to see the matrices.

Then I’ll introduce a real dataset, give you a hint how to subset it, and turn you loose to analyze it yourself.

Steps we’ll do:

  1. Pose a question
  2. Write the model
  3. Estimate parameters with OLS
    • Solve “by hand” in R
    • Using the lm() function
  4. Partition variability: ANOVA
  5. Hypothesis testing: F-test, t-test - is the effect and/or the variance “significant”?

Simple example*

Our two favorite cultivars of clover:

  • Cultivar A, the 4-leaf 🍀
  • Cultivar B, the 3-leaf ☘️

Which one is really best?

We plant 8 plots. Four reps each cultivar.

What influences variation in clover biomass?

  1. Genetics? i.e. the cultivar identity (A vs. B)
  2. The average performance of clovers? An overall (or grand) mean.
  3. Error (random and otherwise)

*I’m borrowing from the excellent example by Dr. Felipe Ferrão here, also the example from Ch. 2 of Isik, Holland, and Maltecca (2017).

Firstly, the model can be visually represented, first as a per-plant equation, then in a statistical, matrix notation, in words and a plot Figure 1.

Figure 1: Visual representation about statistical models applied to genetics (from Ferrao 2024).
ImportantRemember

“All models are wrong, some are useful.” ~Box

We make some simplifying assumptions, always.

We are going to do hypothesis testing. For now, to determine if variances sources are contributing “significantly” to the data (F-test) and if the cultivar means are different (t-test).

This is an important part of some of our main downstream goals, like marker-trait association (GWAS).

Assumptions of linear models are:

  • Independence of errors (no ‘autocorrelation’)
  • Lack of multi-collinearity (predictors not too correlated)

ANOVA with lm()

Two parts of Analysis of Variance

  1. Compute means
  2. Compute deviations (sums of squares, mean squares)
Figure 2: Computing means and sums of squares for ANOVA

And we can write the classical ANOVA table. Variance is partitioned into two components: Among Groups and Within Groups.

Source of Variation Degrees of Freedom (DF) Sum of Squares (SS) Mean Square (MS) F-value
Among Groups \(\alpha−1\) \(SS_A=\sum_{i=1}^{\alpha} n_i ( \bar{y_i\cdot} - \bar{y\cdot\cdot})^2\) \(MS_A = \frac{SS_A}{\alpha-1}\) \(𝐹=\frac{MS_A}{MS_E}\)
Within Groups \(𝑁−\alpha\) \(SS_E = \sum_{i=1}^\alpha \sum_{j=1}^{n_i} (y_{ij} - \bar{y_i\cdot})^2\) \(MS_E = \frac{SS_E}{N-\alpha}\)
Total \(𝑁−1\) \(SS_T = \sum_{i=1}^\alpha \sum_{j=1}^{n_i} (y_{ij} - \bar{y_{\cdot\cdot}})^2\)

Where:

  • \(\alpha\) = number of groups
  • \(n\) = number of observations in group 𝑖
  • \(N\) = total number of observations
  • \(\bar{y_{i\cdot}}\) = mean of group 𝑖
  • \(\bar{y_{\cdot\cdot}}\) = grand mean

Our goal is to partition tot. variance into sources.

We compare the variance among groups (e.g. among genotypes) to the residual variance using a ratio of variances (the F-value).

Basically, the F-test is looking for greater signal compared to noise. A high F-value means among groups > residual and so group variance is probably important.

library(tidyverse)
# Create an example
df <- data.frame(Replicate = 1:4,
                 GenoA = c(14, 17, 15, 15),
                 GenoB = c(11, 13, 10, 12)) 
df <- df %>% 
  # this will change the "wide" data frame to a "long" one
  ## it stacks the values of GenoA and GenoB into a column labelled "Biomass"
  ## and the col headers go to a new column "Genotype"
  pivot_longer(cols = c(GenoA, GenoB),
               names_to = "Genotype",
               values_to = "Biomass") %>% 
  # Make genotype and replicate into factors
    mutate(Genotype=as.factor(Genotype),
           # Replicate needs to be a factor-type
           # because it should be treated as a categorical NOT a numerical variable
           ### Replicate 4 doesn't come after Replicate 3 in any logical way.
           Replicate=as.factor(Replicate))


fit.lm <- lm(Biomass ~ Genotype, data = df)
summary(fit.lm)

Call:
lm(formula = Biomass ~ Genotype, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5000 -0.6875 -0.2500  0.7500  1.7500 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    15.2500     0.6374   23.93  3.5e-07 ***
GenotypeGenoB  -3.7500     0.9014   -4.16  0.00594 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.275 on 6 degrees of freedom
Multiple R-squared:  0.7426,    Adjusted R-squared:  0.6997 
F-statistic: 17.31 on 1 and 6 DF,  p-value: 0.005943
fit.aov<-aov(Biomass ~ Genotype, data = df)
summary(fit.aov)
            Df Sum Sq Mean Sq F value  Pr(>F)   
Genotype     1  28.12  28.125   17.31 0.00594 **
Residuals    6   9.75   1.625                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t.test(Biomass~Genotype,data = df)

    Welch Two Sample t-test

data:  Biomass by Genotype
t = 4.1603, df = 5.9961, p-value = 0.005951
alternative hypothesis: true difference in means between group GenoA and group GenoB is not equal to 0
95 percent confidence interval:
 1.544032 5.955968
sample estimates:
mean in group GenoA mean in group GenoB 
              15.25               11.50 

Notice the way the coefficients are reported from lm().

It shows only an effect for (Intercept) and GenotypeGenoB.

Initially, we wrote the model as

\[ y_{ij} = \mu + \tau_i + \epsilon_{ij} \]

\(i\) genotypes (1,2), \(j\) replications (1,2,3,4)

So our matrices and vectors looked like this:

\[ \begin{bmatrix} y_{11} \\ y_{12} \\ y_{13} \\ y_{14} \\ y_{21} \\ y_{22} \\ y_{23} \\ y_{24} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \end{bmatrix} \begin{bmatrix} \mu \\ \tau_1 \\ \tau_2 \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{13} \\ \epsilon_{14} \\ \epsilon_{21} \\ \epsilon_{22} \\ \epsilon_{23} \\ \epsilon_{24} \end{bmatrix} \]

This specification implies 3 coefficients would be estimated (\(\mu\), \(\tau_1\) and \(\tau_2\)).

Parameter estimation with OLS

To see why the difference, let’s first calculate the OLS solution “by hand”

Recall from class that linear models with fixed-effects and simple residuals can be solved with a closed form approach that minimizes the residual sum of squares (RSS). This is called the “Ordinary Least Squares” solution (OLS).

\[ \hat{\beta} = (X'X)^{-1}X'y \]

df
# A tibble: 8 × 3
  Replicate Genotype Biomass
  <fct>     <fct>      <dbl>
1 1         GenoA         14
2 1         GenoB         11
3 2         GenoA         17
4 2         GenoB         13
5 3         GenoA         15
6 3         GenoB         10
7 4         GenoA         15
8 4         GenoB         12
y<-df$Biomass

# Because R is too smart, we'll have to make the X matrix shown above for a demo

X<-matrix(c(1,1,1,1,1,1,1,1,
            1,0,1,0,1,0,1,0,
            0,1,0,1,0,1,0,1),
          ncol=3)
X
     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    1    0    1
[3,]    1    1    0
[4,]    1    0    1
[5,]    1    1    0
[6,]    1    0    1
[7,]    1    1    0
[8,]    1    0    1

R provides a function called model.matrix() for constructing design matrices. What model.matrix() does, is what lm() and other software do.

X1<-model.matrix(Biomass~Genotype,data = df)
X1
  (Intercept) GenotypeGenoB
1           1             0
2           1             1
3           1             0
4           1             1
5           1             0
6           1             1
7           1             0
8           1             1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$Genotype
[1] "contr.treatment"
XtX<-t(X)%*%X
Xty<-t(X)%*%y
beta_hat<-solve(XtX)%*%Xty
Error in `solve.default()`:
! Lapack routine dgesv: system is exactly singular: U[3,3] = 0

Singular = a square matrix that is not invertible

A related term is “Rank” which refers to the number of linearly independent columns of a matrix.

# What is the "rank" of XtX? Want it to be 3.
qr(XtX)$rank
[1] 2

The XtX matrix is not “full rank” because columns 2 and 3 are numerical mirrors.

There aren’t enough degrees of freedom to express it this way.

In R, functions like lm() will automatically constrain the model by contrasting levels against a reference (the first level of the factor).

So \(\tau_1 = 0\) and \(\tau_2=1\).

\[ \begin{bmatrix} y_{11} \\ y_{12} \\ y_{13} \\ y_{14} \\ y_{21} \\ y_{22} \\ y_{23} \\ y_{24} \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} \mu \\ \tau_2 \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{13} \\ \epsilon_{14} \\ \epsilon_{21} \\ \epsilon_{22} \\ \epsilon_{23} \\ \epsilon_{24} \end{bmatrix} \]

beta_hat<-solve(t(X1)%*%X1)%*%(t(X1)%*%y)
beta_hat
               [,1]
(Intercept)   15.25
GenotypeGenoB -3.75
fit.lm$coefficients
  (Intercept) GenotypeGenoB 
        15.25         -3.75 

Ta da! 🎉

Check assumptions

hist(residuals(fit.lm))
plot(x = fitted(fit.lm), y = residuals(fit.lm))
plot(fit.lm)

Estimated marginal means

Our goal is often to compare the mean of different treatment groups, for example different genotypes.

You may have encountered “lsmeans” or “emmeans” in the literature? Or if you use SAS.

Least squares means (LSMeans)—now more commonly called estimated marginal means (EMMs)—are model-based estimates of factor level means derived from a fitted linear model (LM). They are related to the “best linear unbiased estimators” (BLUEs) in a linear mixed model; we’ll get to all that later.

library(emmeans)
Warning: package 'emmeans' was built under R version 4.5.2
Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'
emmeans(fit.lm,"Genotype")
 Genotype emmean    SE df lower.CL upper.CL
 GenoA      15.2 0.637  6    13.69     16.8
 GenoB      11.5 0.637  6     9.94     13.1

Confidence level used: 0.95 

EMMs are not arithmetic means. They are predicted means from the fitted model. If there were additional factors in our model, they would evaluate our “Genotypes” averaging over the other factors/covariates.

Look at how they are related to the coefficients from the OLS model.

summary(fit.lm)

Call:
lm(formula = Biomass ~ Genotype, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5000 -0.6875 -0.2500  0.7500  1.7500 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    15.2500     0.6374   23.93  3.5e-07 ***
GenotypeGenoB  -3.7500     0.9014   -4.16  0.00594 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.275 on 6 degrees of freedom
Multiple R-squared:  0.7426,    Adjusted R-squared:  0.6997 
F-statistic: 17.31 on 1 and 6 DF,  p-value: 0.005943

Hands-on

Some real data

Download the CCB Crimson Clover ALT dataset from Canvas.

Put it in the data/ sub-directory of your project.

library(tidyverse)

Import the dataset

alt<-read.csv(here::here("data","19-25CC_ALT_DATA_ALL_2025-08-23.csv"))

str(alt)
'data.frame':   3058 obs. of  16 variables:
 $ X               : int  1 2 3 4 5 6 7 8 9 10 ...
 $ site            : chr  "AL" "AL" "AL" "AL" ...
 $ year            : int  24 24 24 24 24 24 24 24 24 24 ...
 $ site.year       : chr  "24AL" "24AL" "24AL" "24AL" ...
 $ unique.id       : chr  "24ALCC_ALT-01-01" "24ALCC_ALT-01-02" "24ALCC_ALT-01-03" "24ALCC_ALT-02-01" ...
 $ entry           : chr  "Aldo" "21NCCCLH" "23MDCC" "21NCCCE" ...
 $ rep             : int  1 1 1 1 1 1 1 1 1 1 ...
 $ row             : int  1 2 3 1 2 3 1 2 3 1 ...
 $ range           : int  1 1 1 2 2 2 3 3 3 4 ...
 $ biomass.1       : num  56.3 46.4 55.8 79.8 60.4 ...
 $ growth.stage.1  : num  9 7 7 9 8 8 9 9 8 7 ...
 $ fallstandcount  : int  NA NA NA NA NA NA NA NA NA NA ...
 $ springstandcount: int  95 95 95 95 95 75 75 95 75 95 ...
 $ fall.vigor.av   : num  5 5 5 5 5 9 5 5 9 5 ...
 $ spring.vigor.av : num  9 5 5 9 5 5 5 5 5 5 ...
 $ wintersurvival  : num  NA NA NA NA NA NA NA NA NA NA ...
alt %>% head
  X site year site.year        unique.id         entry rep row range biomass.1
1 1   AL   24      24AL 24ALCC_ALT-01-01          Aldo   1   1     1  56.33333
2 2   AL   24      24AL 24ALCC_ALT-01-02      21NCCCLH   1   2     1  46.41667
3 3   AL   24      24AL 24ALCC_ALT-01-03        23MDCC   1   3     1  55.75000
4 4   AL   24      24AL 24ALCC_ALT-02-01       21NCCCE   1   1     2  79.75000
5 5   AL   24      24AL 24ALCC_ALT-02-02 23NCCCE Elite   1   2     2  60.41667
6 6   AL   24      24AL 24ALCC_ALT-02-03         Dixie   1   3     2  57.16667
  growth.stage.1 fallstandcount springstandcount fall.vigor.av spring.vigor.av
1              9             NA               95             5               9
2              7             NA               95             5               5
3              7             NA               95             5               5
4              9             NA               95             5               9
5              8             NA               95             5               5
6              8             NA               75             9               5
  wintersurvival
1             NA
2             NA
3             NA
4             NA
5             NA
6             NA

About

The CCBN “Advanced Line Trial” (ALT) for Crimson Clover (Trifolium incarnatum) is a competitive variety trial. New “advanced line” bulks, created from top plants in CCBN nurseries, are compared against current best lines from other programs/years and commercial checks at up to 15 locations. Trials are conducted in randomized complete block designs with 15’ single-row plots flanked by triticale for competition; they assess biomass, vigor, winter survival, seed yield.

Figure 3: Map of Cover Crop Breeding Network Testing Sites
Figure 4: Example of a CCB ALT Layout - 24ALCC_ALT
  • site: State Location Code
  • year: Year HARVESTED (trials are planted in Fall, harvested Spring of following year)
  • site.year: site-year combination
  • unique.id: unique plot ID
  • entry: Crimson Clover variety name.
  • rep: integer - specifies a complete block within a given site.year
  • row: Corresponds to one full planter pass. I’d prefer these to be called columns to match matrix conventions (row – by - column). But that’s how CCB records it (see Figure 4).
  • range: Spanning a plot in the direct the planter drives. In matrix convention, this would be called the row 🤦 (see Figure 4).
  • biomass.1: plot dry mass in grams-per-linear-foot-harvested
  • growth.stage.1: Khalu & Fick growth stage at time of harvest
  • fall/springstandcount: percent of plot out of 100 with living plants, before/after winter.
  • fall/spring.vigor.av: visual rating of plot vigor scored on a time-specific 1-9 scale (i.e. 1 = worst, 5 = average, 9 = best).
  • wintersurvival: spring / fall standcount.

Your task

  • Find a balanced chunk of the data.
  • Balance is going to mean a lack of missing data for most or all traits.
  • I suggest taking out 2-4 site.years worth.
  • Subset the dataset.
  • Pick a trait and follow t(Lortie 2017a, 2017b, 2017c)he steps we did above.
  • Start simple with your modelling.

Questions:

  1. What are the potential sources of variation in the data? Write the model.
  2. What sources of variation are significant contributors to the model fit?
  3. Which are the “best”performing Crimson Clovers?

Data summarization

To find a good data chunk, you might want to make some summaries.

Here’s a nice tidy way to computing summary statistics on a dataset like this, using group_by() and summarize()

# Example
alt %>% 
  filter(site=="AL") %>% 
  group_by(site.year) %>% 
  summarize(PropMissingBiomass=length(which(is.na(biomass.1)))/n(),
            Nentries=length(unique(entry)),
            Nobs=n())
# A tibble: 2 × 4
  site.year PropMissingBiomass Nentries  Nobs
  <chr>                  <dbl>    <int> <int>
1 24AL                       0       28    56
2 25AL                       0       30    60
# Does the code for computing "PropMissingBiomass" look confusing? 
# To learn, break down each piece of the code and see what it does
# Like this
## alt$biomass.1 %>% is.na
## alt$biomass.1 %>% is.na %>% which
## length(which(is.na(alt$biomass.1)))/nrow(alt)

Having chosen the 2 AL site-years (for an example), I can split off a new data.frame for future use.

al_alt<-alt %>% 
  filter(site=="AL")

Plotting data

I like to use ggplot for plotting in R.

A great place to learn that is the Visualize chapter in R for Data Science.

To show it off, I’ll make a histogram of the distribution of each trait in the dataset.

A bit of data transformation is needed, using pivot_longer to make a long-form dataset.

alt %>% 
  pivot_longer(cols = c(biomass.1:wintersurvival),
               names_to = "Trait",
               values_to = "Value") %>% 
  ggplot(.,aes(x=Value,fill=Trait)) + geom_density() + 
  facet_wrap(~Trait, scales='free')
Warning: Removed 4531 rows containing non-finite outside the scale range
(`stat_density()`).

Next

That’s it for today.

In the next session(s) we will introduce linear mixed models and related concepts. We’ll return with Hands-on Lesson 3 on Mixed Models and learn 3 packages (lme4::lmer(), sommer::mmes(), asreml()).

References

Isik, Fikret, James Holland, and Christian Maltecca. 2017. Genetic Data Analysis for Plant and Animal Breeding. Springer International Publishing. https://doi.org/10.1007/978-3-319-55177-7.
Lortie, Christopher. 2017a. R for Data Science.” Journal of Statistical Software 77 (Book Review 1). https://doi.org/10.18637/jss.v077.b01.
———. 2017b. R for Data Science.” Journal of Statistical Software 77 (Book Review 1). https://doi.org/10.18637/jss.v077.b01.
———. 2017c. R for Data Science.” Journal of Statistical Software 77 (Book Review 1). https://doi.org/10.18637/jss.v077.b01.